What to look at
How distributions look like
How to measure
How to investigate
PyData Cluj-Napoca, meetup #13, 2020.05.28
What to look at
How distributions look like
How to measure
How to investigate
❌
❌
✓
❌
❌
✓ ?
✓
What are we measuring?
Gil Tene - "How NOT to Measure Latency"
❌
Whatsapp (2018)
~500M daily active users
Assuming iid delays and avg # requests per user, 99th percentile slowness will affect (1 - 0.99^(65/0.5)) = 72% of the users
Depending on the number of requests per client:
Throughput
Comparing messaging systems performance
Example 1: processing times
Example 2: roundtrip times, single consumer, localhost, constant small processing time
Example 3: roundtrip times, single consumer, internet (short distance), constant small processing time
Example 4: roundtrip times, single consumer, internet (long distance), no processing time (instant reply)
Example 5: roundtrip times, single consumer, localhost, heavy server-side calculations based on a random uniform input
log everything (if possible)
HdrHistogram
add timestamps to packets and chain them
use high resolution timers (if suitable)
build time series with timestamps, counts, events
https://colin-scott.github.io/personal_website/research/interactive_latency.html
Original source: http://norvig.com/21-days.html#answers
https://www.python.org/dev/peps/pep-0418/
measure on the same machine (when possible)
use time synchronization services (NTP, Chrony, etc.)
triangulate different sources
sample times with a ping-like service
account for clock desynchronization
"how?": measure
compare distribution tail over time
know your throughput. improve. scale horizontally or collapse when not possible
offline profiling reveals some of the problems, make sure you have enough info also from prod logs
extract critical scenarios from prod logs. stress tests vs historical replays
your system is not a black box, you can debug it: find the factors that change your latencies
build time series with factors throughout the day
assess their importance
find the dimensions on which delays cluster
prioritize consumer tasks
source of delays are stochastic processes, try to identify and understand them
split delays into multiple transport and processing stages
reduce the number of rountrips needed to build a result
target the critical path and historical or theoretical cases of failure
check outliers, major failures may have a chain of things that went wrong
multimodal distributions are a sign of multiple paths, try to identify the split factor
drifts from parametric distributions